Multi-Modal Data Augmentation for End-to-end ASR

نویسندگان

  • Adithya Renduchintala
  • Shuoyang Ding
  • Matthew Wiesner
  • Shinji Watanabe
چکیده

We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using symbolic input in addition to the traditional acoustic input. This architecture utilizes two separate encoders: one for acoustic input and another for symbolic input, both sharing the attention and decoder parameters. We call this architecture a multi-modal data augmentation network (MMDA), as it can support multi-modal (acoustic and symbolic) input. The MMDA architecture attempts to eliminate the need for an external LM, by enabling seamless mixing of large text datasets with significantly smaller transcribed speech corpora during training. We study different ways of transforming large text corpora into a symbolic form suitable for training our MMDA network. Our best MMDA setup obtains small improvements on CER and achieves 8-10% relative WER improvement on the WSJ data set.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Capacitated Single Allocation P-Hub Covering Problem in Multi-modal Network Using Tabu Search

The goals of hub location problems are finding the location of hub facilities and determining the allocation of non-hub nodes to these located hubs. In this work, we discuss the multi-modal single allocation capacitated p-hub covering problem over fully interconnected hub networks. Therefore, we provide a formulation to this end. The purpose of our model is to find the location of hubs and the ...

متن کامل

Learning Visual Reasoning Without Strong Priors

Achieving artificial visual reasoning — the ability to answer image-related questions which require a multi-step, high-level process — is an important step towards artificial general intelligence. This multi-modal task requires learning a questiondependent, structured reasoning process over images from language. Standard deep learning approaches tend to exploit biases in the data rather than le...

متن کامل

Assistive Robot Multi-modal Interaction with Augmented 3D Vision and Dialogue

This paper presents a multi-modal interface for interaction between people with physical disabilities and an assistive robot. This interaction is performed through a dialogue mechanism and augmented 3D vision glasses to provide visual assistance to an end user commanding an assistive robot to perform Daily Life Activities (DLAs). The augmented 3D vision glasses may provide augmented reality vis...

متن کامل

Multi-Modal Multi-Task Deep Learning for Autonomous Driving

Several deep learning approaches have been applied to the autonomous driving task, many employing end-toend deep neural networks. Autonomous driving is complex, utilizing multiple behavioral modalities ranging from lane changing to turning and stopping. However, most existing approaches do not factor in the different behavioral modalities of the driving task into the training strategy. This pap...

متن کامل

Matching the Acoustic Model to Front-End Signal Processing for ASR in Noisy and Reverberant Environments

Distant-talking automatic speech recognition (ASR) represents an extremely challenging task. The major reason is that unwanted additive interference and reverberation are picked up by the microphones besides the desired signal. A hands-free human-machine interface should therefore comprise a powerful acoustic preprocessing unit in line with a robust ASR back-end. However, since perfect speech e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018